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Abstract 

Feeding training data to statistical representa- 
tions of language has become a popular past- 
time for computational linguists, but our un- 
derstanding of what constitutes a sufficient 
volume of data remains shadowy. For ex- 
ample, Brown et al. (1992) used over 500 
million words of text to train their language 
model. Is this enough? Could devouring even 
more data further improve the accuracy of the 
model learnt? In this paper I explore a number 
of issues in the analysis of data requirements 
for statistical NLP systems]^ A framework for 
viewing such systems is proposed and a sam- 
ple of existing works are compared within this 
framework. Finally, the first steps toward a 
theory of data requirements are made by es- 
tablishing an upper bound on the expected er- 
ror rate of a class of statistical language learn- 
ers as a function of the volume of training 
data. 



1 Introduction 

Statistical approaches to natural language 
processing are becoming increasingly popular, 
being applied to a wide variety of tasks. For 
example, Weischedel et al. (1993) explores 
part-of-speech tagging, parsing and acquisi- 
tion of lexical frames. Nonetheless, all these 
tasks share some important characteristics, 
not the least of which is the requirement for a 
sizable corpus of training data. One question 
which has largely been ignored is how much 



^ This paper has been pubhshed in the Proceedings 
of the Second Conference of the Pacific Association for 
Computational Linguistics, Brisbane, Austraha, 1995. 



data is enough? For example, given a limited 
body of training data, it is essential to know 
which statistical NLP methods are likely to be 
accurate before pursuing any one. Also, given 
a particular method, when will acquiring fur- 
ther training data cease to improve the system 
accuracy? Currently, the field is conspicuously 
lacking a general theory of data requirements 
for statistical NLP. 

In this paper, I present the first steps to- 
wards the development of such a theory. I 
begin by formulating a framework for statis- 
tical NLP systems designed to capture some 
of the elements crucial to data requirements 
analysis. I will then review a sample of exist- 
ing approaches, showing how they fit into the 
framework, and where they vary from it. Even 
though several of these introduce complexities 
which are not captured by the framework, rea- 
soning in the framework still supports some 
important insights into these systems. Finally, 
I present some preliminary work on establish- 
ing a closed form upper bound on data require- 
ments for a class of statistical NLP systems. 
This latter work owes much to Mark Johnson, 
Brown University, who is responsible for sev- 
eral key mathematical ideas in Section |[ 

Statistical NLP systems are designed to 
make choices; hopefully in an informed man- 
ner. To do this they use indicators, upon 
which their choices are conditioned. The pur- 
pose of computing statistics is to inductively 
establish the relationship between the indica- 
tors and the choices to be made. Consider for 
example a next word predictor which attempts 
to predict the next word on the basis of the 
preceding word. To do this it must have an 
understanding of the relationship between the 
indicator (the preceding word) and the choice 
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(the next word). It is possible to acquire this 
understanding by computing statistics over a 
large corpus, a process called training. Once 
trained, the system may then be applied to 
a new text and its accuracy evaluated. This 
paper is concerned with the dependence of a 
system's accuracy on the size of the training 
corpus. In the following section, the notions 
of indicators, choices and training data will be 
made more formal. 

2 Statistical Processors 

2.1 A Framework 

A statistical NLP system deals with a certain 
linguistic universe. Formally, there is a set of 
linguistic events Q from which every training 
example and every test instance will be drawn. 
In the next word predictor, this need only be 
the set of all pairs of words which may be ad- 
jacent in text. 

Let y be a finite set of values that we would 
like to assign to a given linguistic input. This 
defines the range of possible answers that the 
analyser can make. In the predictor, this is 
the set of words plus an end of sentence sym- 
bol. Let J : n — > y be the random variable 
describing the distribution of values that lin- 
guistic events take on. We also require a set 
of indicators, B, to use in selecting a value for 
a given linguistic event. I will refer to each el- 
ement of B cLS Si bin. In the predictor, the set 
of bins is the set of words plus a start of sen- 
tence symbol. Let I : Q B he the random 
variable describing the distribution of bins into 
which linguistic events fall. 

The task of the analyser is to choose which 
value is most likely given only the indicator. 
Therefore, it is a function A : B ^ V. The 
task of the learning algorithm is to acquire this 
function by computing statistics on the train- 
ing set. 

Putting these components together, we can 
define a Statistical Processor, S as a tuple 
{n,B,V,A), where: 

• n is the set of all possible linguistic events 

• B and V are finite sets, the bins and val- 
ues respectively 

• A is the trained analysis function 



Amongst all such statistical processors, 
there is a special class in which we are in- 
terested. Define a probabilistic analyser to 
be a statistical processor which computes a 
function p : B X V [0, 1] such that 
J2v£V Pi^' v) = 1 for all 6 e B and then com- 
putes A as: 

A{b) = argmax„gyp(6, v) (1) 

The problem of acquiring A is thus trans- 
formed into one of estimating the function p 
using the training corpus. Generally, p{b, v) 
is viewed as an estimate of the probability 
Pr(J = t;|/ = 6). 

2.2 Training Data 

Formally, a training corpus, c, of m instances, 
is an element from [B x V)"^ where each pair 
(6, v) is sampled according to the random vari- 
ables / and J from il. For probabilistic anal- 
ysers, there are a variety of methods by which 
an appropriate function p can be estimated 
from a corpus; one simple example being the 
Maximum Likelihood Estimate. Regardless 
of the learning algorithm used, each possible 
training corpus, c, results in the acquisition of 
some function, Pc- Our aim is to explore the 
dependence of the expected accuracy of Pc on 
the magnitude of m. 

Surprisingly, it is not always obvious how 
many training instances have been used to 
train a statistical method. It is not gener- 
ally sufiicient to report the size of the cor- 
pus in words. A system which collects word 
associations using a window of cooccurrence 
10 words wide will find 819 instances in a 
100 word corpus, while one collecting the ob- 
jects of the preposition on from the same cor- 
pus, would most likely find only a few in- 
stances. Therefore, before any conclusions can 
be drawn about data requirements, the train- 
ing corpus must be measured in terms of in- 
stances. 

Each of these instance falls into a particu- 
lar bin by virtue of its associated indicator. 
In choosing the indicators, we have implicitly 
defined equivalence classes for instances. The 
statistical processor will treat every instance 
in a bin identically. Further, once the bins 
are chosen, the greater the number of training 
instances that fall into a bin, the greater our 
confidence in the statistical inference made by 
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the processor for test cases in that bin. For in- 
stance, the next word predictor is more hkely 
to be correct when the preceding word is com- 
mon than when it is a rare word. 

It is not always obvious how many bins a 
given statistical method employs. Often mul- 
tiple indicators are used. For instance, a tri- 
gram tagger uses the tags of the two preceding 
words and the current word to choose a new 
tag. In this case, B = T x T x W where T is 
the tagset and W is the vocabulary. 

This example demonstrates an important 
point. By choosing to take into account the 
tags of two preceding words, the trigram tag- 
ger requires |T| times as many bins as a bigram 
tagger (where B = T x W). With more bins, 
the trigram tagger is sensitive to a broader 
range of context and thus can in principle 
achieve a greater accuracy. However, because 
there are more bins, there are fewer training 
instances in each bin. Thus, statistical esti- 
mation will be less accurate. In practice, high 
accuracy requires at least a few training in- 
stances per bin. Thus increasing the number 
of indicators may actually decrease the overall 
accuracy. 

For probabilistic analysers it is useful to de- 
fine the number of slots, L, to be — 1), 
which is the number of independent parame- 
ters needed to define the function p. 

2.3 Error Rates and Optimality 

For any non-trivial general statistical proces- 
sor the indicators used cannot perfectly rep- 
resent the entire linguistic event space. Thus, 
in general there exist values vi,V2 G V, for 
which both Pr(J = = &) > and 

Pr(J = V2\I = h) > Q for some b e B. Sup- 
pose without loss of generality that A{b) = vi. 
The analyser will be in error with probability 
at least Pr(J — V2,I — b). This is the root 
of a rather difficult problem in statistical NLP 
because no matter how inaccurate a trained 
statistical processor is, the inaccuracy may be 
due to the imperfect representation of f2 by _B. 

I 

Probabilistic analysers always select just 
one value for each bin, the one which max- 
imises p. Let v^g^g{p,b) = argmax„gyp(6,i;). 
This leads to a definition for the expected er- 

^ Unless a more accurate statistical processor based 
on the same indicators already exists. 



ror rate of a function p, R{p) : 

R{p) = (2) 
^Pr(/ = 6)( Pr(J-«|/ = 6)) 

This is the probability of the analyser be- 
ing in error on a randomly selected element of 
n. Let Pgpi be any function which minimises 
the expected error rate and r^p^ — Ripgpi). 
Given B and V , Vgpf is the smallest possible 
expected error rate. Any probabilistic anal- 
yser which achieves an accuracy close to this is 
unlikely to benefit from further training data. 

Unless large volumes of manually annotated 
data exist, measuring the size of r^p^ in any 
given statistical processor presents a difficult 
challenge. Hindle and Rooth (1993) have at- 
tempted a similar task using human subjects 
on the problem of prepositional phrase attach- 
ment. Subjects were given only the prepo- 
sition and the preceding verb and noun and 
then were asked to select the attachment. This 
was precisely the task facing their statistical 
processor. The subjects could only perform 
the attachment correctly in around 86% of 
cases. If we assume that the subjects incor- 
rectly analysed the remaining 14% of cases 
because these cases depended on knowledge of 
the wider context, then any statistical learning 
algorithm based only on these indicators can- 
not do better than 86%. Of course, if there is 
insufficient training data the system may do 
considerably worse. 

Assuming that human performance on the 
task accurately reflects the value of r^pf is the 
only means known at present to estimate the 
value of rgp^. Unfortunately, this approach 
is expensive to apply and makes a number of 
questionable psychological assumptions. For 
example, it assumes that humans can accu- 
rately reproduce parts of their language anal- 
ysis behaviour on command. It may also suffer 
when representational aspects of the analysis 
task cannot be explained easily to experimen- 
tal subjects. A worthwhile goal for future re- 
search is to establish a statistical method for 
estimating or bounding r^p^ using language 
data. 

3 Statistical Learning 
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3.1 Existing Methods 

In this section, I show how a number of exist- 
ing statistical NLP works fit into the frame- 
work, including a tagger, a sense disambigua- 
tor and three syntactic analysers. For each, I 
consider how the various elements of the gen- 
eral statistical processor are instantiated. 

Weischedel et al. (1993) uses (among other 
experiments) a trigram hidden Markov model 
to tag text for part of speech. The training 
data is four million words of the University of 
Pennsylvania Treebank, tagged with a set of 
47 different tags. I shall regard B as consist- 
ing of the two previous tags {T xT), while V 
is simply the tagset. The system also takes 
into account lexical tag frequencies (that is, 
B = T X T X W). I will assume however 
that data sparseness does not affect the lex- 
ical tag frequency estimates. Since the tri- 
gram estimates and the lexical tag frequencies 
are combined as independent factors, ignor- 
ing the lexical component does not seem un- 
reasonable. The situation is further compli- 
cated because probability is maximised over 
a sequence of words, rather than for a single 
word. The framework needs to be extended 
to capture these mechanisms, but for the mo- 
ment the approximations I have made may be 
useful. Since every word in the corpus (bar the 
first two) is used for training, we have m — A 
million and L = 47 x 47 x 46. The accuracy is 
reported to be around 97%, which is approx- 
imately the accuracy of human taggers using 
the whole context. 

Yarowsky (1992) describes a sense disam- 
biguation system which uses a 100 word win- 
dow of cooccurrences. He uses a mutual 
information-like measure which combines the 
cooccurrence statistics for all words in each 
category of Roget's thesaurus. The result is 
a profile of contexts for a category which can 
be used to estimate how likely each category 
is within a certain context. Comparing the 
different possible categories for the word pro- 
vides a broad sense discrimination. The train- 
ing corpus is Grolier's encyclopedia which con- 
tains on the order of 10 million words. Each 
of these provides 100 training instances (ev- 
ery other word in the window), so m « 1 
billion. Since the evidence from each word 
in the context is combined independently, it 
is reasonable to regard B as simply the set 
of distinct words in Grolier's. Again, further 



work is needed to make this approximation un- 
necessary. V is the set of Roget's categories 
(|T^| = 1042), so assuming the vocabulary is 
around 100,000, L « 100 million. f\ The aver- 
age accuracy reported is 92%. 

Hindle and Rooth (1993) propose a sys- 
tem to syntactically disambiguate preposi- 
tional phrase attachments. Unambiguous ex- 
amples of attachments are used to find lex- 
ical associations (a likelihood ratio) between 
prepositions and the nouns or verbs they at- 
tach to. They cyclically apply this technique, 
adding disambiguated attachments into the 
training set, until all the training data (am- 
biguous or not) has been used. This approach 
can be approximated by a probabilistic anal- 
yser. Each association value is ascribed to a 
pair {w,p) where w is a verb or noun and p 
is a preposition. Thus _B is a product of two 
indicator spaces: the set of verbs and nouns 
and the set of prepositions. Assuming they 
used 10,000 nouns and verbs (5,000 of each) 
and 100 prepositions, \B\ — 1 million. The 
analyser computes a probability for each of 
two possible attachments, nominal and verbal, 
so V is binary. The training set consists of 
754,000 noun attachments and 468 thousand 
verb ones giving m = 1.22 million. ^ The 
accuracy reported is close to 80%, while hu- 
man subjects given the same indications could 
achieve 85-88% accuracy. If the latter figure 
reflects the optimal error rate, it appears there 
is still room for improvement by adding train- 
ing data or changing the statistical measures. 

Lauer (1994) describes a system for syn- 
tactically analysing compound nouns. Two- 
word compounds extracted from Grolier's en- 
cyclopedia were used to measure mutual infor- 
mation between every pair of thesaurus cat- 
egories (using Roget's thesaurus) and the re- 
sults used to select a bracketing for three- word 
compounds. Since an association value is com- 
puted for every pair of thesaurus categories, 
\B\ is equal to 1043 x 1043. There are only 
two possible bracketings to choose from, so 
again V is binary. The training corpus consists 
of about 35,000 two-word compounds, giving 
m = 35, 000 and L « 1 million. The accuracy 
reported is 75%. 

•^Some stemming is performed, so it is the number 
of stems in the vocabulary that we want. 

*I have allowed all the training examples as in- 
stances, even though some are acquired by cyclic 
refinement. 
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Resnik and Hearst (1993) aim to enhance 
Hindle and Rooth's (1993) work by incorpo- 
rating information about the head noun of the 
prepositional phrase in question. Thus B is 
now a product of three spaces: the set of nouns 
and verbs, the set of prepositions and the set 
of nouns. To reduce the data requirement, a 
freely available on-line thesaurus, called Word- 
Net is used (Bcckwith et al, 1991). WordNet 
groups words into synsets, categories of syn- 
onymous words. These synsets are arranged 
in a taxonomy, so that every word is also pro- 
vided with a list of hypernyms. The system 
then adds together the frequency counts for 
nouns within a synset, providing more data 
about each. This reduces the number of bins, 
since it is the synsets which are taken as in- 
dicators rather than individual words. If we 
assume roughly 1000 synsets, 1000 verbs and 
100 prepositions, then \B\ = 200 million. V is 
still binary. Their training corpus is an "or- 
der of magnitude smaller than" Hindle and 
Rooth's, so m is around 100,000. Unlike Hin- 
dle and Rooth's, their corpus is parsed, which 
should give better results. Interestingly, they 
combine evidence from large groups of synsets 
within WordNet 's hypernym hierarchy using 
a t-test. This causes the effective number of 
synsets for nouns to be reduced, perhaps by as 
much as a factor of 10 (thus \B\ « 20 million). 
I will therefore assume that L « 20 million. 
Even given the additional information about 
the head noun of the prepositional phrase, the 
accuracy reported fails to improve on that of 
Hindle and Rooth, being 78%. It is possible 
that insufficient training data is the cause of 
this shortfall. 

Table |^ shows a summary of the above sys- 
tems, ordered on the ratio m : L. A strong 
correlation is evident between the value of this 
ratio and the success rate. This suggests that 
the success of a statistically based system is 
strongly dependent on the confidence permit- 
ted by the training set size as measured by this 
ratio. 

3.2 An Important Trade-off 

The model formulated above and the empiri- 
cal data presented support a number of qual- 

^ Brackets indicate measured on different data 
and/or under different conditions. 
^Reported in Resnik(1993). 
^Reported in Dras and Lauer(1993). 



itative inferences about the potential of sys- 
tems given a fixed training set size. Because 
training data will always be limited, such rea- 
soning is an important part of system design. 
Therefore before turning to some quantitative 
analyses, I will examine a few such inferences. 

The most important of these is in regard to 
linguistic sophistication, that is the degree to 
which the system uses knowledge of the pat- 
terns of language. This kind of knowledge 
is extremely important, since it often allows 
just the right distinctions to be made. More 
simplistic systems will inevitably assign one 
choice to two different inputs because their 
linguistic knowledge fails to support a distinc- 
tion. Therefore, it seems desirable to incorpo- 
rate as much linguistic sophistication as possi- 
ble. While this is a tempting direction to take 
for improving system performance, there is a 
barrier. 

Consider, for instance, the effect on data 
requirements of incorporating new indicators. 
Each indicator increases the number of dis- 
tinctions which the system can make. For ex- 
ample, Resnik and Hearst (1993) take into ac- 
count the object of the preposition. In doing 
so, they distinguish cases which Hindle and 
Rooth (1993) did not. As a result, the number 
of cases their system considers is substantially 
larger than those considered by Hindle and 
Rooth's. In terms of the framework, Resnik 
and Hearst have many more bins than Hindle 
and Rooth. 

It is easy to see that incorporating a new 
indicator increases the number of bins combi- 
natorially. The size of B is multiplied by the 
range of the new indicator. This results in the 
ratio m : L falling by the same factor, which, 
as I have argued above, can be detrimental to 
the overall accuracy. 

The situation is worse still if the training 
set is not hand annotated. In this case, in- 
troducing the new indicator creates additional 
ambiguity in the training set since the value of 
the new indicator must be determined for each 
training example. This effectively decreases 
the number of training instances resulting in 
a further decrease in m : L. 

Thus, linguistic sophistication presents a 
trade-off between accuracy and data sparse- 
ness. It is a balance between poor modeling 
of the language and insufficient data for ac- 
curate statistics. If we are to strike a satis- 
factory compromise, we need a strong theory 
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System 


Training Source 


TO 


L 


TO : L 


Accuracy 


Humans 


Weischedel et al. 

Yarowsky 
Hindle & Rooth 

Lauer 
Resnik & Hearst 


Manual Supervision 
Unsupervised 
Automatic Supervision 
Automatic Supervision 

Manual Supervision 


4M 
IG 
1.2M 
35k 
100k 


100k 
lOOM 
-IM 
IM 
>20M 


40 

10 

-1 
0.035 
<0.005 


97% 
92% 
80% 
75% 
78% 


(< 97%) ^ 

85-88% 
(< 80%) 6 
(< 92%) 7 



Table 1: Summary of a sample of statistical NLP systems 



of data requirements and ways to make more 
economic use of data. 

One such method is termed conceptual as- 
sociation, as defined in Resnik and Hearst 
(1993). By collecting statistics based on con- 
cepts rather than individual words, the num- 
ber of bins is usually reduced. The idea is to 
generalise findings about words to cover other 
words which have the same meaning. The ad- 
vantages of this approach are extensively ar- 
gued in Resnik (1993) and the method is used 
in Lauer (1994). While concepts can help, 
the ambiguity introduced (namely what con- 
cept does a given word belong to) may under- 
mine the increased accuracy. Further work is 
needed to establish the effects on data require- 
ments of employing this strategy. 

A novel extension to this approach that 
has not yet been employed, would be to col- 
lect statistics at various levels of granular- 
ity. Statistics computed on counts of individ- 
ual words would provide fine sensitivity, while 
statistics computed on counts of a small set of 
semantic primitives (such as ANIMATE, AB- 
STRACT, etc.) would provide the coarsest 
evidence. As many levels as desired between 
these two extremes could be employed in this 
way. The level used to make each choice could 
then be selected according to the degree of 
confidence available at each level. If insuf- 
ficient data has been seen to allow a confi- 
dent selection at one level, a coarser grained 
level would be tried. Resnik and Hearst (1993) 
seem to be simulating this when they perform 
a t-test across all levels of the WordNet hier- 
archy. 



4 First Steps Towards a 
Theory 



4.1 A Simple Learning Scheme 

In this section I shall establish some lower 
bounds on the accuracy of a simple training 
scheme within the framework developed. The 



mathematics presented in Sections 4.2 through 
|4.3| was for the most part developed by Mark 
Johnson of Brown University and completed 
by the author. I wish to thank him for his 
many communications in this regard. 

Let t{b,c) = {{b,v)\v G V,{b,v) G c}, the 
training instances in a corpus c that fall into 
bin b. 

Let f(v,t) = \{ib,v)\b G B,{b,v) G t}\, the 
frequency of the value v in the set of training 
instances, t. 

Let mode{b,c) = argmax„gy/(u, i(6, c)), 
the most common value for instances from a 
corpus c in a bin 6. Where several values have 
equal frequencies, one should be chosen at ran- 
dom. 

Define the learning algorithm such that: 



1 \i V = mode{b, c) 
otherwise 



Since each bin has only one value with non- 
zero probability, V is effectively a binary set 
(either the instance has the non-zero value or 
it does not). Thus, L = \B\. Notice also that 
the value assigned highest probability by Pc is 
the one most frequently falling into the bin. 
That is, v^gij^^{Pc,b) — mode{b,c). 

Two possible cases arise when the analyser 
is faced with making a decision on the basis of 
some indication, b. Either the corpus contains 
no occurrences of (6, v) for any value v E V 
(Case A) or there is some training data which 
falls into the bin (Case B). 

4.2 Empty Bins 

Case A arises when none of the training in- 
stances fall into the bin. Let pb denote 
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Pr(/ = b). The probability of bin b being 
empty after training on a randomly selected 
corpus is (1 — ph)'". Thus the probability over 
all test inputs of there being no training data 
in the bin for that input is: 



Since we know that X^fces = 1, it is possible 
to show that the maximum for w occurs when 
V6 e i? = 1^ . Therefore: 



w<{l 



1 



< e 



So for even quite small values of m/|i3|, the 
probability that any given test sample falls 
into a bin for which we received no train- 
ing samples is very low. For example, when 
Tnl\B\ > 3, they occur in less than 5% of in- 
puts. 

4.3 Non-empty Bins 

In Case B, we have at least one instance in the 
corpus for the given bin. Let n = \t{b, c)| > 1. 
An optimal function, Pgpf;, will be one which 
chooses for bin b the value v that max- 
imises Pr(J = v\I = b). Let Vgpi{b) — 
argmax^,gy Pr( J = v\I ~ b), the most likely 
value in bin b. Let q{b) = Pr(J = Vgpi(b)\I — 
b), the probability of this value given an in- 
stance in bin b. Notice from equation 
that the expected error rate is minimised when 
Vbe B v^g^^{p,b) = Vgp^{b). Therefore: 



'''opt 



Y,Vr{I = b){ ^r{J^v\I^b)) 
beB vev\{v .(b)} 



= J2^riI = b){Pr{J^v^pt{b)\I^b)) 

beB 

= ^Pr(/ = 6)(l-q(fo)) 

beB 

Since r^p^ is the best possible error rate, 
it follows that q{b) must be high for most 
bins if the system is to work at all. There- 
fore, VQpi{b) should be a frequent value in 
each bin. Now if more than half of the in- 
stances in a bin have the value Vopt, then 
this must be the most common value in the 
bin. Thus, if J{vQpi{b),t{b,c)) > |, then 



mode{b,c) — Vopti^)- by computing the 
probability of f {vQp^{b),t{b,c)) > we can 
obtain a lower bound for the accuracy on bin 

b. Fl 



Pi-(t;opt(6) = w^orfgb, b)\I = b) 
= FT{mode{b, c) — Vgpi{b)\I = b) 

n 

> P<f{voptib),t{b,c))^t\I^b) 



Thus ^ Pr(J = v\I = 6) 



since Pr(J 
> Pr( J = 



^optib)\I^b) 



Pr(wopi(6) = Vrnode(P^ b)\I = b) 



heiUnib) 



1 



{l~q{b)r-'q{bY 



This is an upper bound on the expected er- 
ror rate for bin b. ^ So: 

R{Pc) <Y.^<^ ^b)U^{b) 

beB 

As noted above r^pi = X^hefiP^'l^ = b){l — 
q{b)). A comparison between the upper bound 
Un{b) and the optimal error rate 1 — 17(6) shows 
that for reasonably high values of q{b) that 
Un{b) is close to 1 — q{b). For example, when 
q{b) > 0.9, Usib) < L26(l-g(5)) and Usib) < 
1.08(1 -q{b)). 



"The argument shown holds for all odd n. A vari- 
ation of the argument that bounds the expected accu- 
racy for all even n is simple to construct. 

"For odd n. When n is even it can be shown that 
is U„—i{b) is an upper bound. 



7 



In fact, we can derive a bound for any n as 
follows: 

Un{b) < Ui{b) 

= i-q{b)q{b) (3) 

< 2(1 

- 2Vr{J^v^ptm = b) 

Thus in all bins which have training in- 
stances in the corpus, the expected error rate 
for the bin never exceeds twice the optimal 
error rate for that bin. It is interesting to 
note that Hindle and Rooth's (1993) system 
has roughly one instance per bin and an op- 
timal error rate of 12% (assuming the human 
accuracy of 88% is optimal), so that equation 
(^ predicts a lower bound of 77% accuracy. 
This is just less than the 80% observed. 

When 5 instances from the training corpus 
fall into the bin, the expected error rate ap- 
proaches the optimal error rate closely and 
when there are an average of 3 instances per 
bin, very few bins do not have instances from 
the training corpus. So, in general it appears 
that 3-5 instances per bin will be sufficient. 

4.4 Skewed Bins 

An obvious question is why systems such as 
Lauer (1994) and Resnik and Hearst (1993) 
work at all given that far less than one in- 
stance is expected for each slot. One possible 
answer is that different bins have widely dif- 
fering frequencies. The system quickly learns 
about the most frequent cases at the expense 
of less frequent ones. 

This can be modeled by considering the 
different distributions, pb defined for Case A 
above. In that analysis, the probability of en- 
countering an empty bin was maximised over 
all possible distributions. However, if some- 
thing is known about the distribution, in prin- 
ciple a tighter bound is possible. For exam- 
ple, suppose some fraction of the bins have 
very low probability. That is, 3B' C B such 
that X^beB' Pb — c for some small c. Let 
B" = B\ B' and /3 = Now: 



= ^^5(1-^6)"+ ^ Pb(l~Pfc)^ 
beB' beB" 

beB' beB" 

< Pbi^-Pbr 



beB" 

Now the second term is maximised when V6 £ 
B"Pt = 0\ = j0]-So letting A = T^: 



< c + e" 



V/3clS| 



beB 



^^1 wish to thank Eugene Charniak for pointing out 
this fact. 



So knowing a pair of values c and /3 is a use- 
ful, if primitive, means of lowering the upper 
bound on data requirements. Since the distri- 
bution of bins does not depend on the values 
we are seeking to learn, it should be possible to 
develop simple techniques for estimating val- 
ues of c and (3. 

4.5 Future Work 

A great deal of work remains to be done. I 
will mention only a few directions where the 
work begs to be extended. First, the mathe- 
matical model doesn't capture several aspects 
of existing models, such as maximising prob- 
abilities over sequences of words and combin- 
ing evidence from multiple sources. Second, 
the simple learning algorithm presented differs 
from those used in practice in several ways. 
It would be useful to explore the relationship 
between the algorithm I have proposed and 
others in existing statistical methods (for ex- 
ample, the backing off method in Katz, 1987). 
Third, smoothing is frequently used to allevi- 
ate data sparseness (see Dagan et al, 1993), 
but the model does not include any means to 
represent the process of smoothing. Finally, 
almost all statistical NLP systems deal with 
some noise in the training data. This is es- 
pecially important in systems like Yarowsky 
(1992) where training is unsupervised. The 
mathematical results need to be extended to 
reflect noisy training data and to support rea- 
soning about the sensitivity of data require- 
ments to noise. 
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5 Conclusion 

In this paper I have indicated the lack of a 
general theory of data requirements within the 
field of statistical NLP. As a first step in the 
development of such a theory I have presented 
a framework for statistical NLP systems. I 
have shown how several prominent works in 
the field fit this model and demonstrated a 
number of mathematical results which support 
inferences about data requirements. I believe 
this represents a significant first step along the 
road to a better understanding of when and 
how statistical NLP methods can be applied. 
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